The purpose of this notebook is twofold. First, it demonstrates the basic functionality of PyLogit for estimating nested logit models. Secondly, it compares the nested logit capabilities of PyLogit with Python Biogeme. The dataset used is the SwissMetro dataset from http://biogeme.epfl.ch/examples_swissmetro.html. For an explanation of the variables in the dataset, see http://www.strc.ch/conferences/2001/bierlaire1.pdf


In [1]:
from collections import OrderedDict    # For recording the model specification 

import pandas as pd                    # For file input/output
import numpy as np                     # For vectorized math operations
import statsmodels.tools.numdiff as numdiff       # For numeric hessian
import scipy.linalg                    # For matrix inversion

import pylogit as pl                   # For choice model estimation
from pylogit import nested_logit as nl # For nested logit convenience funcs

1. Load the Swissmetro Dataset


In [2]:
# Load the raw swiss metro data
# Note the .dat files are tab delimited text files
swissmetro_wide = pd.read_table("../data/swissmetro.dat", sep='\t')

2. Clean the dataset

Note that the 09NestedLogit.py file provided is an example from Python Biogeme (see: http://biogeme.epfl.ch/examples_swissmetro.html). The 09NestedLogit.py file excludes observations meeting the following critera:

exclude = (( PURPOSE != 1 ) * (  PURPOSE   !=  3  ) + ( CHOICE == 0 )) > 0
As a result, their dataset has 6,768 observations. Below, I make the same exclusions.


In [3]:
# Select obervations whose choice is known (i.e. CHOICE != 0)
# **AND** whose PURPOSE is either 1 or 3
include_criteria = (swissmetro_wide.PURPOSE.isin([1, 3]) &
                    (swissmetro_wide.CHOICE != 0))

# Use ".copy()" so that later on, we avoid performing operations 
# on a view of a dataframe as opposed to on an actual dataframe
clean_sm_wide = swissmetro_wide.loc[include_criteria].copy()

# Look at how many observations we have after removing unwanted
# observations
final_num_obs = clean_sm_wide.shape[0]
num_obs_statement = "The cleaned number of observations is {:,.0f}."
print (num_obs_statement.format(final_num_obs))


The cleaned number of observations is 6,768.

3. Create an id column that ignores the repeat observations per individual

In the simple example given on the Python Biogeme website for 09NestedLogit.py, the repeated observations per individual are treated as separate and independent observations. We will do the same.


In [4]:
# Create a custom id column that ignores the fact that this is a 
# panel/repeated-observations dataset, and start the "custom_id" from 1
clean_sm_wide["custom_id"] = np.arange(clean_sm_wide.shape[0], dtype=int) + 1

4. Convert the data from 'wide' format to 'long' format

4a. Determine the 'type' of each column in the dataset.


In [5]:
# Look at the columns of the swissmetro data
clean_sm_wide.columns


Out[5]:
Index([u'GROUP', u'SURVEY', u'SP', u'ID', u'PURPOSE', u'FIRST', u'TICKET',
       u'WHO', u'LUGGAGE', u'AGE', u'MALE', u'INCOME', u'GA', u'ORIGIN',
       u'DEST', u'TRAIN_AV', u'CAR_AV', u'SM_AV', u'TRAIN_TT', u'TRAIN_CO',
       u'TRAIN_HE', u'SM_TT', u'SM_CO', u'SM_HE', u'SM_SEATS', u'CAR_TT',
       u'CAR_CO', u'CHOICE', u'custom_id'],
      dtype='object')

In [6]:
# Create the list of individual specific variables
ind_variables = clean_sm_wide.columns.tolist()[:15]

# Specify the variables that vary across individuals **AND** 
# across some or all alternatives
alt_varying_variables = {u'travel_time': dict([(1, 'TRAIN_TT'),
                                               (2, 'SM_TT'),
                                               (3, 'CAR_TT')]),
                          u'travel_cost': dict([(1, 'TRAIN_CO'),
                                                (2, 'SM_CO'),
                                                (3, 'CAR_CO')]),
                          u'headway': dict([(1, 'TRAIN_HE'),
                                            (2, 'SM_HE')]),
                          u'seat_configuration': dict([(2, "SM_SEATS")])}

# Specify the availability variables
availability_variables = dict(zip(range(1, 4), ['TRAIN_AV', 'SM_AV', 'CAR_AV']))

# Determine the columns that will denote the
# new column of alternative ids, and the columns
# that denote the custom observation ids and the 
# choice column
new_alt_id = "mode_id"
obs_id_column = "custom_id"
choice_column = "CHOICE"

4b. Actually perform the conversion from wide to long formats


In [7]:
# Perform the desired conversion
long_swiss_metro = pl.convert_wide_to_long(clean_sm_wide, 
                                           ind_variables, 
                                           alt_varying_variables, 
                                           availability_variables, 
                                           obs_id_column, 
                                           choice_column,
                                           new_alt_id_name=new_alt_id)

# Look at the first 9 rows of the long-format dataframe
long_swiss_metro.head(9).T


Out[7]:
0 1 2 3 4 5 6 7 8
custom_id 1 1 1 2 2 2 3 3 3
mode_id 1 2 3 1 2 3 1 2 3
CHOICE 0 1 0 0 1 0 0 1 0
GROUP 2 2 2 2 2 2 2 2 2
SURVEY 0 0 0 0 0 0 0 0 0
SP 1 1 1 1 1 1 1 1 1
ID 1 1 1 1 1 1 1 1 1
PURPOSE 1 1 1 1 1 1 1 1 1
FIRST 0 0 0 0 0 0 0 0 0
TICKET 1 1 1 1 1 1 1 1 1
WHO 1 1 1 1 1 1 1 1 1
LUGGAGE 0 0 0 0 0 0 0 0 0
AGE 3 3 3 3 3 3 3 3 3
MALE 0 0 0 0 0 0 0 0 0
INCOME 2 2 2 2 2 2 2 2 2
GA 0 0 0 0 0 0 0 0 0
ORIGIN 2 2 2 2 2 2 2 2 2
DEST 1 1 1 1 1 1 1 1 1
seat_configuration 0 0 0 0 0 0 0 0 0
travel_time 112 63 117 103 60 117 130 67 117
headway 120 20 0 30 10 0 60 30 0
travel_cost 48 52 65 48 49 84 48 58 52

5. Create the variables used in the Python Biogeme Nested Logit Model Example

In 09NestedLogit.py, the travel time and travel cost variables are scaled for ease of numeric optimization. We will do the same such that our estimated coefficients are comparable.


In [8]:
# Scale both the travel time and travel cost by 100
long_swiss_metro["travel_time_hundredth"] = (long_swiss_metro["travel_time"] /
                                             100.0)

# Figure out which rows correspond to train or swiss metro 
# alternatives for individuals with GA passes. These individuals face no 
# marginal costs for a trip
train_pass_train_alt = ((long_swiss_metro["GA"] == 1) *
                        (long_swiss_metro["mode_id"].isin([1, 2]))).astype(int)
# Note that the (train_pass_train_alt == 0) term accounts for the
# fact that those with a GA pass have no marginal cost for the trip
long_swiss_metro["travel_cost_hundredth"] = (long_swiss_metro["travel_cost"] *
                                             (train_pass_train_alt == 0) /
                                             100.0)


/Users/timothyb0912/anaconda/lib/python2.7/site-packages/pandas/computation/expressions.py:190: UserWarning: evaluating in Python space because the '*' operator is not supported by numexpr for the bool dtype, use '&' instead
  unsupported[op_str]))

6. Specify and Estimate the Python Biogeme Nested Logit Model Example

6a. Specify the Model


In [9]:
# Specify the nesting values
nest_membership = OrderedDict()
nest_membership["Future Modes"] = [2]
nest_membership["Existing Modes"] = [1, 3]

In [10]:
# Create the model's specification dictionary and variable names dictionary
# NOTE: - Keys should be variables within the long format dataframe.
#         The sole exception to this is the "intercept" key.
#       - For the specification dictionary, the values should be lists
#         or lists of lists. Within a list, or within the inner-most
#         list should be the alternative ID's of the alternative whose
#         utility specification the explanatory variable is entering.

example_specification = OrderedDict()
example_names = OrderedDict()

# Note that 1 is the id for the Train and 3 is the id for the Car.
# The next two lines are placing alternative specific constants in
# the utility equations for the Train and for the Car. The order
# in which these variables are placed is chosen so the summary
# dataframe which is returned will match that shown in the HTML
# file of the python biogeme example.
example_specification["intercept"] = [3, 1]
example_names["intercept"] = ['ASC Car', 'ASC Train']

# Note that the names used below are simply for consistency with
# the coefficient names given in the Python Biogeme example.
# example_specification["travel_cost_hundredth"] = [[1, 2, 3]]
# example_names["travel_cost_hundredth"] = ['B_COST']

example_specification["travel_cost_hundredth"] = [[1, 2, 3]]
example_names["travel_cost_hundredth"] = ['B_COST']

example_specification["travel_time_hundredth"] = [[1, 2, 3]]
example_names["travel_time_hundredth"] = ['B_TIME']

6b. Estimate the model

One main difference between the nested logit implementation in PyLogit and in Python Biogeme or mLogit in R is that PyLogit reparameterizes the 'standard' nested logit model. In particular, one standard reperesntation of the nested logit model is in terms of the inverse of the 'scale' parameter for each nest (see for example the representation given by Kenneth Train in section 4.2 here). The 'scale' parameter has domain from zero to infinity, therefore the inverse of the scale parameter has the same domain.

However, for econometric purposes (such as conforming to the assumptions that individuals are making choices through a utility maximizing decision protocol), the scale parameter of a 'lower level nest' is constrained to be greater than or equal to 1 (assuming that the 'upper level nest' is constrained to 1.0 for identification purposes). The inverse of the scale parameter would then be constrained to be between 0.0 and 1.0 in this case. In order to make use of unconstrained optimization algorithms, we therefore estimate the logit ( i.e. $\ln \left[ \frac{\textrm{scale}^{-1}}{1.0 - \textrm{scale}^{-1}} \right]$) of the inverse of the scale parameter, assuming that the inverse of the scale parameter will lie between zero and one (and accordingly that the scale parameter be greater than or equal to one).


In [11]:
# Define a function that calculates the "logit" transformation of values
# between 0.0 and 1.0.
def logit(x):
    """
    Parameters
    ----------
    x : int, float, or 1D ndarray.
        If an array, all elements should be ints or floats. All
        elements should be between zero and one, exclusive of 1.0.

    Returns
    -------
    The logit of x:  `np.log(x / (1.0 - x))`.
    """
    return np.log(x/(1.0 - x))

In [12]:
# Provide the module with the needed input arguments to create
# an instance of the MNL model class
example_nested = pl.create_choice_model(data=long_swiss_metro,
                                        alt_id_col=new_alt_id,
                                        obs_id_col=obs_id_column,
                                        choice_col=choice_column,
                                        specification=example_specification,
                                        model_type="Nested Logit",
                                        names=example_names,
                                        nest_spec=nest_membership)

# Specify the initial nesting parameter values
# Note: This should be in terms of the reparameterized values used
# by PyLogit.

# Note: The '40' corresponds to scale parameter that is numerically
# indistinguishable from 1.0

# Note: 2.05 is the scale parameter that is estimated by PythonBiogeme
# so we invert it, then take the logit of this inverse to get the
# corresponding starting value to be used by PyLogit.
# Note the first value corresponds to the first nest in 'nest_spec'
# and the second value corresponds to the second nest in 'nest_spec'.
init_nests = np.array([40, logit(2.05**-1)])

# Specify the initial index coefficients used by PythonBiogeme
init_coefs = np.array([-0.167, -0.512, -0.899, -0.857])

# Create a single array of the initial values
init_values = np.concatenate((init_nests, init_coefs), axis=0)

# Start the model estimation from the pythonbiogeme initial values
# Note that the first value, in the initial values, is constrained
# to remain constant through the estimation process. This is because
# the first nest in nest_spec is a 'degenerate' nest with only one
# alternative, and the nest parameter of degenerate nests is not
# identified.
example_nested.fit_mle(init_values,
                       constrained_pos=[0])


Log-likelihood at zero: -6,964.6630
Initial Log-likelihood: -5,239.0782
Estimation Time: 0.18 seconds.
Final log-likelihood: -5,236.9000

Also, note that the functionality of using parameter constraints is restriced to the Mixed Logit and Nested Logit models at the moment. Moreover, this functionality is only relevant when using optimization method that make use of gradient information. Gradient-free estimation methods such as 'powell's' method or 'nelder-mead' will not make use of the constrained_pos keyword argument.

6.c Compare the model output with that of Python Biogeme


In [13]:
# Look at the estimated coefficients and goodness-of-fit statistics
example_nested.get_statsmodels_summary()


Out[13]:
Nested Logit Model Regression Results
Dep. Variable: CHOICE No. Observations: 6,768
Model: Nested Logit Model Df Residuals: 6,762
Method: MLE Df Model: 6
Date: Tue, 30 Aug 2016 Pseudo R-squ.: 0.248
Time: 15:20:48 Pseudo R-bar-squ.: 0.247
converged: True Log-Likelihood: -5,236.900
LL-Null: -6,964.663
coef std err z P>|z| [95.0% Conf. Int.]
Future Modes Nest Param 40.0000 nan nan nan nan nan
Existing Modes Nest Param -0.0527 0.020 -2.584 0.010 -0.093 -0.013
ASC Car -0.1672 0.032 -5.243 0.000 -0.230 -0.105
ASC Train -0.5119 0.035 -14.781 0.000 -0.580 -0.444
B_COST -0.8567 0.036 -23.578 0.000 -0.928 -0.785
B_TIME -0.8987 0.034 -26.228 0.000 -0.966 -0.832

Compare with PythonBiogeme


In [14]:
# Note that the Mu (i.e the scale parameter) estimated by python biogeme is 
# 1.0 / nest_coefficient where
# nest_coefficient = 1.0 / (1.0 + exp[-1 * estimated_nest_param])
pylogit_mu = 1.0 + np.exp(-1 * example_nested.params["Existing Modes Nest Param"])
print "PyLogit's estimated Mu is: {:,.4f}".format(pylogit_mu)


PyLogit's estimated Mu is: 2.0541

Summary

My parameter estimates match those of Python Biogeme.
The Python Biogeme log-likelihood is -5,236.900 and their estimated parameters are:

ASC Car:    -0.167
ASC Train:  -0.512
B_COST:     -0.857
B_TIME:     -0.899
Mu:          2.05

As shown above, my log-likelihood is -5,236.900, and my estimated parameters are:

ASC Car:                   -0.1672
ASC Train:                 -0.5119
B_COST:                    -0.8567
B_TIME:                    -0.8987
Existing Modes Nest Param:  2.0541

PyLogit's covariance estimates for the Nested Logit model are currently based on the BHHH approximation to the Fisher Information Matrix. This is the same procedure used by mlogit. However, based on the disaggreement between PyLogit's standard errors and those of Python Biogeme, Python Biogeme is clearly not using the BHHH approximation to the Fisher Information Matrix to calculate its standard errors. How does Python Biogeme calculate its standard errors?

Investigate the use of numeric approximations to the Hessian


In [15]:
# Create objects for all of the necessary arguments that are
# needed to compute the log-likelihood of the nested logit model
# given the data used in this example
nested_design = example_nested.design
mapping_res = example_nested.get_mappings_for_fit()
choice_array = long_swiss_metro["CHOICE"].values

# Create a 'convenience' function that simply returns the log-likelihood
# given a vector of coefficients
def convenient_log_likelihood(all_coefs):
    log_likelihood = nl.convenient_nested_log_likelihood(all_coefs,
                                                         nested_design,
                                                         mapping_res["rows_to_obs"],
                                                         mapping_res["rows_to_nests"],
                                                         choice_array)
    return log_likelihood

# Calculate the numeric hessian
numeric_hess = numdiff.approx_hess(example_nested.params.values,
                                   convenient_log_likelihood)
# Account for the fact that the first param is constrained
numeric_hess[0, :] = 0
numeric_hess[:, 0] = 0
numeric_hess[0, 0] = -1
# Calculate the asymptotic covariance with the numeric hessian
numeric_cov = -1 * scipy.linalg.inv(numeric_hess)
# Get the numeric standard errors
numeric_std_errs = pd.Series(np.sqrt(np.diag(numeric_cov)),
                             index=example_nested.params.index)
# Make sure the Future Modes Nest param has a standard error of np.nan
numeric_std_errs.loc["Future Modes Nest Param"] = np.nan
# Order the numeric standard errors according to the Python Biogeme
# output
numeric_std_errs = pd.concat([numeric_std_errs[example_nested.params.index[2:]],
                              numeric_std_errs[example_nested.params.index[:2]]],
                             axis=0)
# Display the numeric standard errors
numeric_std_errs


Out[15]:
ASC Car                      0.037136
ASC Train                    0.045180
B_COST                       0.046273
B_TIME                       0.056991
Future Modes Nest Param           NaN
Existing Modes Nest Param    0.111669
dtype: float64

Python Biogeme Output

Name       Value    Std err t-test  p-value
ASC_CAR    -0.167   0.0371   -4.50  0.00
ASC_TRAIN     -0.512    0.0452  -11.33  0.00
B_COST      -0.857  0.0463  -18.51  0.00
B_TIME      -0.899  0.0570  -15.77  0.00
MU           2.05    0.118    17.45 0.00

From above, we see that for the index coefficients, the standard errors that are calculated using the numeric approximation of the hessian match the standard errors returned by Python Biogeme. This suggests that the standard errors of Python Biogeme, for the nested logit model, are based on a numeric differentiation approximation to the Hessian.

Below, we investigate whether the numeric approximation of the gradient via numeric differentiation is a close approximation to the analytic gradient. The premise is that if the numeric gradient does not adequately approximate the analytic gradient, then what chance does the numeric hessian have of adequately approximating the analytic hessian?


In [16]:
# Approximate the gradient using numeric differentiation
numeric_grad = numdiff.approx_fprime(example_nested.params.values,
                                     convenient_log_likelihood)
pd.DataFrame([numeric_grad,
              example_nested.gradient.values],
             index=["Numeric Differentiation", "Analytic"],
             columns=example_nested.params.index).T


Out[16]:
Numeric Differentiation Analytic
Future Modes Nest Param 0.000000 0.000000e+00
Existing Modes Nest Param 0.000610 -5.474989e-08
ASC Car 0.000000 -3.642717e-08
ASC Train -0.000119 -1.286828e-07
B_COST 0.000000 1.057123e-07
B_TIME 0.000000 -1.959603e-07

From the dataframe above, we see that the numeric gradient does not adequately approximate the analytic gradient. The numeric gradient is incorrect by a factor of 2 to 4 depending on what variable is being examined (this excludes the Future Modes Nest Param that constrained to 40).

Given that the numeric gradient does not provide a good approximation of the analytic gradient, I do not expect the numeric hessian (nor the BHHH approximation used by PyLogit) to provide a good approximation to the analytic hessian. Since the standard errors calculated from the numeric hessian matches PythonBioeme's standard errors, this suggests that Python Biogeme is calculating its hessian (and therefore standard errors) via numeric differentiation, and as suggested above, this is likely to be poor approximation to the analytic hessian.


In [ ]: